Although we have multiple ways to understand and group our providers (i.e. through line of business, size, etc.), we still do not have a full understanding of how to segment our providers. In an environment where our providers feel like we do not know who they are and how to best serve their needs, it is imperative to gain more intuition into the true provider personas. As a healthcare insurance company, creating these provider segments allows us to understand a provider group’s common concerns, background information, etc. and grants us the ability to interact with individual providers with unique insights into their issues. This analysis directly adds value to the company because we are then able to positively influence Net Promoter Scores (NPS) and develop a trusting relationship between us and our provider partners. As an outcome, this analysis could be used in call centers so our agents can tailor their conversation to the provider persona. Furthermore, this analysis could be used within our Advocate Network to help facilitate stronger communication and empathy.
What provider personas can be formed through unsupervised clustering methods?
Based on the data and completing exploratory data analysis, below are expected features that may determine provider clusters.
* Providers accepting Medicare assignments
* Providers using electronic health records
* Providers with a secondary specialty
* Providers with a hospital affiliation
* Providers in a hospital network
For this analysis, open source provider data was used from the Centers for Medicare & Medicaid Services. Please vist here for the most up-to-date source. This data was used because it has robust documentation, available SMEs, and current data refreshes. Furthermore, the data was fairly clean and was in a format that was easy to complete feature engineering. The total file contains 2.67 million rows; however, due to issues with high performance computing using R on a local machine, I opted to compute a simple random sample to get 5% of the total data, which amounted to ~133 thousand observations. This dataset contains demographic and Medicare quality program participation information for individual eligible professionals.
Within this analysis, I opted to include all providers taken from the SRS of 5% of the total CMS dataset because these were all active and eligible providers. Furthermore, every provider in this dataset was represented as a single observation, and only individual providers (not organizations) were represented as a row. In addition, one additional constraint to this dataset is that it was taken from the CMS website, which indicates that these providers have had some affiliation with Medicare and Medicaid. Therefore, providers not in our cohort are those that do not have any past affiliation with CMS.
# Eventually I would like to directly source the information from the CMS website and cache it
# setCacheDir(tempdir())
#
# simpleCache('provider.data', {provider <- read.socrata("https://data.medicare.gov/resource/c8qv-268j.csv")})
provider <- read_csv('Physician_Compare_National_Downloadable_File.csv',
col_types = paste(rep('c',41), sep = '',collapse = ''))
Sample data to make data easier to work with initially. Get 5% of total dataset.
provider <- sample_n(provider,
size = ceiling(.05*nrow(provider)),
replace = TRUE)
Make valid names for columns, i.e. remove spaces in column names
names(provider) <- make.names(names(provider))
Change to numerical variables
# provider$NPI <- as.factor(provider$NPI)
provider$years.after.grad <- as.year(Sys.Date()) - as.year(provider$Graduation.year)
provider$Number.of.Group.Practice.members <- as.numeric(provider$Number.of.Group.Practice.members)
Check the provider dataframe
str(provider)
## Classes 'tbl_df', 'tbl' and 'data.frame': 133061 obs. of 42 variables:
## $ NPI : chr "1730519752" "1528148947" "1669788949" "1508113853" ...
## $ PAC.ID : chr "7416184403" "2264629666" "0749463453" "3678715455" ...
## $ Professional.Enrollment.ID : chr "I20131219000824" "I20101210000903" "I20110317000580" "I20130807000177" ...
## $ Last.Name : chr "STONE" "PENMETSA" "GRAY" "PLESSNER" ...
## $ First.Name : chr "ROBYN" "SANTHI" "TERRANCE" "MELISSA" ...
## $ Middle.Name : chr NA NA "WAYNE" "R" ...
## $ Suffix : chr NA NA NA NA ...
## $ Gender : chr "F" "F" "M" "F" ...
## $ Credential : chr NA NA NA NA ...
## $ Medical.school.name : chr "OTHER" "OTHER" "OTHER" "NEW YORK COLLEGE OF OSTEO MEDICINE OF NEW YORK INSTITUTE OF TECHNOLOGY" ...
## $ Graduation.year : chr "2013" "1993" "2010" "2010" ...
## $ Primary.specialty : chr "PHYSICAL THERAPY" "RHEUMATOLOGY" "CERTIFIED REGISTERED NURSE ANESTHETIST" "HOSPITALIST" ...
## $ Secondary.specialty.1 : chr NA "INTERNAL MEDICINE" NA "INTERNAL MEDICINE" ...
## $ Secondary.specialty.2 : chr NA NA NA NA ...
## $ Secondary.specialty.3 : chr NA NA NA NA ...
## $ Secondary.specialty.4 : chr NA NA NA NA ...
## $ All.secondary.specialties : chr NA "INTERNAL MEDICINE" NA "INTERNAL MEDICINE" ...
## $ Organization.legal.name : chr "MILTON CHIROPRACTIC AND REHABILITATION INC" "ROCKDALE BLACKHAWK LLC" "ST JOHN HOSPITAL AND MEDICAL CENTER" "UT PHYSICIANS" ...
## $ Group.Practice.PAC.ID : chr "0941113708" "2769487115" "3173424082" "8426960360" ...
## $ Number.of.Group.Practice.members : num 41 86 272 NA 146 6 118 38 35 460 ...
## $ Line.1.Street.Address : chr "111 WILLARD ST" "611 W HWY 6 101" "22101 MOROSS RD" "1333 MOURSUND ST" ...
## $ Line.2.Street.Address : chr "SUITE GA" NA NA "MEMORIAL HERMANN TIRR" ...
## $ Marker.of.address.line.2.suppression : chr NA NA NA NA ...
## $ City : chr "QUINCY" "WACO" "DETROIT" "HOUSTON" ...
## $ State : chr "MA" "TX" "MI" "TX" ...
## $ Zip.Code : chr "021691200" "767107545" "482362148" "770303405" ...
## $ Phone.Number : chr "6174714491" "2547554582" "3133434000" "7137975929" ...
## $ Hospital.affiliation.CCN.1 : chr NA "451357" "230165" "450068" ...
## $ Hospital.affiliation.LBN.1 : chr NA "LITTLE RIVER HEALTHCARE" "ST JOHN HOSPITAL AND MEDICAL CENTER" "MEMORIAL HERMANN TEXAS MEDICAL CENTER" ...
## $ Hospital.affiliation.CCN.2 : chr NA "451385" "230216" NA ...
## $ Hospital.affiliation.LBN.2 : chr NA "GOODALL WITCHER HOSPITAL" "MCLAREN PORT HURON" NA ...
## $ Hospital.affiliation.CCN.3 : chr NA NA NA NA ...
## $ Hospital.affiliation.LBN.3 : chr NA NA NA NA ...
## $ Hospital.affiliation.CCN.4 : chr NA NA NA NA ...
## $ Hospital.affiliation.LBN.4 : chr NA NA NA NA ...
## $ Hospital.affiliation.CCN.5 : chr NA NA NA NA ...
## $ Hospital.affiliation.LBN.5 : chr NA NA NA NA ...
## $ Professional.accepts.Medicare.Assignment : chr "Y" "Y" "Y" "Y" ...
## $ Reported.Quality.Measures : chr "Y" "Y" NA "Y" ...
## $ Used.electronic.health.records : chr NA "Y" NA NA ...
## $ Committed.to.heart.health.through.the.Million.Hearts..initiative.: chr NA NA NA NA ...
## $ years.after.grad : num 5 25 8 8 29 35 22 23 30 19 ...
Number of observations in this dataset: 133061
Number of features in this dataset: 42
Convert multiple character columns to factors
for (i in names(provider)) {
if (class(provider[[i]]) == "character") {
provider[[i]] <- factor(provider[[i]])
}
}
Give top 10 counts of all columns and summary of numerical data
for (i in names(provider)) {
cat("\n")
if (class(provider[[i]]) == "numeric") {
print(i)
print(summary(provider[[i]]))
} else {
provider %>%
group_by_(.dots = i) %>%
dplyr::summarize(num_provider = n()) %>%
arrange(desc(num_provider)) %>%
head(n = 10) %>%
print()
}
}
##
## # A tibble: 10 x 2
## NPI num_provider
## <fct> <int>
## 1 1275731010 15
## 2 1578595815 10
## 3 1285608661 9
## 4 1104904655 8
## 5 1265408579 8
## 6 1760812796 8
## 7 1336359462 7
## 8 1720342355 7
## 9 1033256771 6
## 10 1063507713 6
##
## # A tibble: 10 x 2
## PAC.ID num_provider
## <fct> <int>
## 1 7416128319 15
## 2 1153301494 10
## 3 5294802104 9
## 4 1557599982 8
## 5 2567367121 8
## 6 6800883992 8
## 7 3870740269 7
## 8 5496849440 7
## 9 0244429884 6
## 10 0547332686 6
##
## # A tibble: 10 x 2
## Professional.Enrollment.ID num_provider
## <fct> <int>
## 1 I20110926000873 15
## 2 I20040726000703 10
## 3 I20090320000509 9
## 4 I20031205000385 8
## 5 I20070910000738 8
## 6 I20140110000957 8
## 7 I20120816000932 7
## 8 I20120820000599 7
## 9 I20031111000371 6
## 10 I20040601000630 6
##
## # A tibble: 10 x 2
## Last.Name num_provider
## <fct> <int>
## 1 SMITH 661
## 2 PATEL 595
## 3 JOHNSON 556
## 4 LEE 497
## 5 MILLER 490
## 6 BROWN 363
## 7 WILLIAMS 355
## 8 JONES 334
## 9 KIM 314
## 10 ANDERSON 295
##
## # A tibble: 10 x 2
## First.Name num_provider
## <fct> <int>
## 1 MICHAEL 2746
## 2 DAVID 2474
## 3 JOHN 2378
## 4 ROBERT 1936
## 5 JAMES 1744
## 6 WILLIAM 1364
## 7 MARK 1303
## 8 JENNIFER 1218
## 9 RICHARD 1205
## 10 THOMAS 1190
##
## # A tibble: 10 x 2
## Middle.Name num_provider
## <fct> <int>
## 1 <NA> 32657
## 2 A 10115
## 3 M 9830
## 4 J 7442
## 5 L 7087
## 6 R 5292
## 7 S 4938
## 8 E 4876
## 9 C 4429
## 10 D 4333
##
## # A tibble: 9 x 2
## Suffix num_provider
## <fct> <int>
## 1 <NA> 130577
## 2 JR. 1520
## 3 III 551
## 4 II 257
## 5 IV 86
## 6 SR. 54
## 7 I 12
## 8 IX 2
## 9 V 2
##
## # A tibble: 2 x 2
## Gender num_provider
## <fct> <int>
## 1 M 75631
## 2 F 57430
##
## # A tibble: 10 x 2
## Credential num_provider
## <fct> <int>
## 1 <NA> 88139
## 2 MD 31562
## 3 PA 2766
## 4 NP 2023
## 5 DO 1866
## 6 CNA 1172
## 7 DC 1030
## 8 PT 995
## 9 OD 955
## 10 CSW 711
##
## # A tibble: 10 x 2
## Medical.school.name num_provider
## <fct> <int>
## 1 OTHER 65305
## 2 INDIANA UNIVERSITY SCHOOL OF MEDICINE 1170
## 3 WAYNE STATE UNIVERSITY SCHOOL OF MEDICINE 1135
## 4 UNIVERSITY OF MINNESOTA MEDICAL SCHOOL 919
## 5 OHIO STATE UNIVERSITY COLLEGE OF MEDICINE 864
## 6 UNIVERSITY OF ILLINOIS AT CHICAGO HEALTH SCIENCE CENTER 861
## 7 UNIVERSITY OF MICHIGAN MEDICAL SCHOOL 853
## 8 TEMPLE UNIVERSITY SCHOOL OF MEDICINE 849
## 9 PHILADELPHIA COLLEGE OF OSTEOPATHIC MEDICINE 824
## 10 JEFFERSON MEDICAL COLLEGE OF THOMAS JEFFERSON UNIVERSITY 820
##
## # A tibble: 10 x 2
## Graduation.year num_provider
## <fct> <int>
## 1 2010 4162
## 2 2009 4138
## 3 2008 4115
## 4 2007 4100
## 5 2011 3951
## 6 2002 3924
## 7 2003 3906
## 8 2001 3903
## 9 2004 3898
## 10 2000 3850
##
## # A tibble: 10 x 2
## Primary.specialty num_provider
## <fct> <int>
## 1 NURSE PRACTITIONER 12674
## 2 INTERNAL MEDICINE 11366
## 3 PHYSICIAN ASSISTANT 11067
## 4 FAMILY PRACTICE 10057
## 5 DIAGNOSTIC RADIOLOGY 8560
## 6 PHYSICAL THERAPY 5271
## 7 CERTIFIED REGISTERED NURSE ANESTHETIST 4308
## 8 CARDIOVASCULAR DISEASE (CARDIOLOGY) 4231
## 9 ANESTHESIOLOGY 4124
## 10 OBSTETRICS/GYNECOLOGY 3924
##
## # A tibble: 10 x 2
## Secondary.specialty.1 num_provider
## <fct> <int>
## 1 <NA> 113925
## 2 INTERNAL MEDICINE 7063
## 3 CARDIOVASCULAR DISEASE (CARDIOLOGY) 1385
## 4 CRITICAL CARE (INTENSIVISTS) 1264
## 5 PEDIATRIC MEDICINE 788
## 6 GENERAL SURGERY 620
## 7 FAMILY PRACTICE 508
## 8 GERIATRIC MEDICINE 424
## 9 INTERVENTIONAL RADIOLOGY 414
## 10 EMERGENCY MEDICINE 405
##
## # A tibble: 10 x 2
## Secondary.specialty.2 num_provider
## <fct> <int>
## 1 <NA> 130957
## 2 INTERNAL MEDICINE 949
## 3 PULMONARY DISEASE 168
## 4 MEDICAL ONCOLOGY 80
## 5 PAIN MANAGEMENT 79
## 6 PEDIATRIC MEDICINE 72
## 7 NUCLEAR MEDICINE 65
## 8 VASCULAR SURGERY 61
## 9 SLEEP MEDICINE 59
## 10 INTERVENTIONAL CARDIOLOGY 42
##
## # A tibble: 10 x 2
## Secondary.specialty.3 num_provider
## <fct> <int>
## 1 <NA> 132819
## 2 INTERNAL MEDICINE 47
## 3 SLEEP MEDICINE 34
## 4 PERIPHERAL VASCULAR DISEASE 19
## 5 NUCLEAR MEDICINE 18
## 6 VASCULAR SURGERY 13
## 7 PULMONARY DISEASE 12
## 8 MEDICAL ONCOLOGY 11
## 9 PEDIATRIC MEDICINE 11
## 10 SPORTS MEDICINE 9
##
## # A tibble: 10 x 2
## Secondary.specialty.4 num_provider
## <fct> <int>
## 1 <NA> 133021
## 2 SLEEP MEDICINE 5
## 3 INTERNAL MEDICINE 4
## 4 VASCULAR SURGERY 4
## 5 MEDICAL ONCOLOGY 3
## 6 PAIN MANAGEMENT 3
## 7 GENERAL PRACTICE 2
## 8 INTERVENTIONAL PAIN MANAGEMENT 2
## 9 NEUROLOGY 2
## 10 OBSTETRICS/GYNECOLOGY 2
##
## # A tibble: 10 x 2
## All.secondary.specialties num_provider
## <fct> <int>
## 1 <NA> 113925
## 2 INTERNAL MEDICINE 6736
## 3 CARDIOVASCULAR DISEASE (CARDIOLOGY) 1009
## 4 PEDIATRIC MEDICINE 787
## 5 CRITICAL CARE (INTENSIVISTS) 771
## 6 GENERAL SURGERY 581
## 7 FAMILY PRACTICE 458
## 8 INTERVENTIONAL RADIOLOGY 387
## 9 PAIN MANAGEMENT 387
## 10 GERIATRIC MEDICINE 363
##
## # A tibble: 10 x 2
## Organization.legal.name num_provider
## <fct> <int>
## 1 <NA> 13303
## 2 REGENTS OF THE UNIVERSITY OF MICHIGAN 1322
## 3 THE CLEVELAND CLINIC FOUNDATION 1012
## 4 SOUTHERN CALIFORNIA PERMANENTE MEDICAL GROUP 790
## 5 UNIVERSITY OF PITTSBURGH PHYSICIANS 640
## 6 PERMANENTE MEDICAL GROUP INC 630
## 7 IHC HEALTH SERVICES INC 624
## 8 PHYSICIANS REFERRAL SERVICE 574
## 9 NORTH SHORE - LIJ MEDICAL PC 536
## 10 UNIVERSITY OF PENN MEDICAL GROUP 501
##
## # A tibble: 10 x 2
## Group.Practice.PAC.ID num_provider
## <fct> <int>
## 1 <NA> 13303
## 2 3779496856 1322
## 3 1850203555 1012
## 4 6002729175 790
## 5 8729990239 640
## 6 8921910225 630
## 7 1850209420 624
## 8 7911801410 574
## 9 3375701568 536
## 10 6204730955 501
##
## [1] "Number.of.Group.Practice.members"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 16.0 78.0 197.6 288.0 995.0 34259
##
## # A tibble: 10 x 2
## Line.1.Street.Address num_provider
## <fct> <int>
## 1 10685 CARNEGIE AVE 352
## 2 600 N WOLFE ST 189
## 3 1515 HOLCOMBE BLVD CLINICAL LAB 180
## 4 200 1ST ST SW 175
## 5 1515 HOLCOMBE BLVD HEMOPATHOLOGY 172
## 6 9500 EUCLID AVE 167
## 7 111 E 210TH ST 164
## 8 13123 E 16TH AVE 159
## 9 9300 EUCLID AVE 148
## 10 55 FRUIT ST 147
##
## # A tibble: 10 x 2
## Line.2.Street.Address num_provider
## <fct> <int>
## 1 <NA> 99255
## 2 SUITE 100 1613
## 3 SUITE 200 1401
## 4 SUITE 101 863
## 5 SUITE 300 629
## 6 SUITE 1 535
## 7 SUITE 201 501
## 8 SUITE 102 492
## 9 SUITE A 462
## 10 SUITE 2 393
##
## # A tibble: 2 x 2
## Marker.of.address.line.2.suppression num_provider
## <fct> <int>
## 1 <NA> 127520
## 2 Y 5541
##
## # A tibble: 10 x 2
## City num_provider
## <fct> <int>
## 1 HOUSTON 2085
## 2 NEW YORK 1678
## 3 ANN ARBOR 1323
## 4 PHILADELPHIA 1213
## 5 CHICAGO 1206
## 6 PITTSBURGH 1100
## 7 LOS ANGELES 1049
## 8 CLEVELAND 946
## 9 BALTIMORE 932
## 10 DALLAS 847
##
## # A tibble: 10 x 2
## State num_provider
## <fct> <int>
## 1 CA 10754
## 2 TX 9452
## 3 NY 8532
## 4 PA 7891
## 5 FL 7452
## 6 MI 5805
## 7 IL 5507
## 8 OH 5470
## 9 NC 4186
## 10 MA 3760
##
## # A tibble: 10 x 2
## Zip.Code num_provider
## <fct> <int>
## 1 481095000 1128
## 2 441950001 667
## 3 770304000 452
## 4 326103003 311
## 5 110303816 270
## 6 152124756 238
## 7 900950001 230
## 8 212870005 187
## 9 559050001 175
## 10 104672401 164
##
## # A tibble: 10 x 2
## Phone.Number num_provider
## <fct> <int>
## 1 <NA> 18642
## 2 8663892727 404
## 3 2164443475 352
## 4 7136204000 211
## 5 4154761000 209
## 6 7137926313 180
## 7 7137945446 174
## 8 7189204321 162
## 9 7207771234 159
## 10 2164458124 148
##
## # A tibble: 10 x 2
## Hospital.affiliation.CCN.1 num_provider
## <fct> <int>
## 1 <NA> 38797
## 2 230046 1154
## 3 360180 804
## 4 450076 543
## 5 230038 423
## 6 080001 397
## 7 330214 383
## 8 140010 364
## 9 050262 357
## 10 390164 356
##
## # A tibble: 10 x 2
## Hospital.affiliation.LBN.1 num_provider
## <fct> <int>
## 1 <NA> 38970
## 2 UNIVERSITY OF MICHIGAN HEALTH SYSTEM 1154
## 3 CLEVELAND CLINIC 804
## 4 UNIVERSITY OF TEXAS M D ANDERSON CANCER CENTER,THE 543
## 5 SPECTRUM HEALTH - BUTTERWORTH CAMPUS 423
## 6 CHRISTIANA CARE HEALTH SERVICES, INC. 397
## 7 NYU HOSPITALS CENTER 383
## 8 EVANSTON HOSPITAL 364
## 9 RONALD REAGAN U C L A MEDICAL CENTER 357
## 10 UPMC PRESBYTERIAN SHADYSIDE 356
##
## # A tibble: 10 x 2
## Hospital.affiliation.CCN.2 num_provider
## <fct> <int>
## 1 <NA> 84816
## 2 330106 185
## 3 050112 153
## 4 390111 152
## 5 390263 149
## 6 360180 136
## 7 390326 131
## 8 360230 120
## 9 450184 119
## 10 490007 118
##
## # A tibble: 10 x 2
## Hospital.affiliation.LBN.2 num_provider
## <fct> <int>
## 1 <NA> 85085
## 2 NORTH SHORE UNIVERSITY HOSPITAL 185
## 3 ST JOSEPH MEDICAL CENTER 180
## 4 ST FRANCIS HOSPITAL 156
## 5 SANTA MONICA - UCLA MED CTR & ORTHOPAEDIC HOSPITAL 153
## 6 HOSPITAL OF UNIV OF PENNSYLVANIA 152
## 7 LEHIGH VALLEY HOSPITAL - MUHLENBERG 149
## 8 CLEVELAND CLINIC 136
## 9 FAIRVIEW HOSPITAL 131
## 10 ST LUKE'S HOSPITAL - ANDERSON CAMPUS 131
##
## # A tibble: 10 x 2
## Hospital.affiliation.CCN.3 num_provider
## <fct> <int>
## 1 <NA> 107369
## 2 110008 82
## 3 360364 79
## 4 520189 71
## 5 240001 68
## 6 490007 62
## 7 360077 60
## 8 490046 60
## 9 520139 55
## 10 150024 52
##
## # A tibble: 10 x 2
## Hospital.affiliation.LBN.3 num_provider
## <fct> <int>
## 1 <NA> 107595
## 2 NORTHSIDE HOSPITAL CHEROKEE 82
## 3 AURORA MEDICAL CTR KENOSHA 71
## 4 AURORA MEDICAL CENTER 68
## 5 NORTH MEMORIAL MEDICAL CENTER 68
## 6 FAIRVIEW HOSPITAL 63
## 7 SENTARA NORFOLK GENERAL HOSPITAL 62
## 8 GOOD SAMARITAN HOSPITAL 61
## 9 SENTARA LEIGH HOSPITAL 60
## 10 ST JOSEPH'S HOSPITAL 59
##
## # A tibble: 10 x 2
## Hospital.affiliation.CCN.4 num_provider
## <fct> <int>
## 1 <NA> 118454
## 2 050537 57
## 3 360230 53
## 4 520206 48
## 5 490119 45
## 6 260062 41
## 7 450068 38
## 8 360077 37
## 9 490046 36
## 10 150157 34
##
## # A tibble: 10 x 2
## Hospital.affiliation.LBN.4 num_provider
## <fct> <int>
## 1 <NA> 118583
## 2 AURORA MEDICAL CENTER 77
## 3 SUTTER DAVIS HOSPITAL 57
## 4 HILLCREST HOSPITAL 53
## 5 SENTARA PRINCESS ANNE HOSPITAL 45
## 6 SAINT LUKES NORTHLAND HOSPITAL 41
## 7 ST ANTHONY HOSPITAL 41
## 8 GOOD SAMARITAN HOSPITAL 39
## 9 MEMORIAL HERMANN TEXAS MEDICAL CENTER 38
## 10 FAIRVIEW HOSPITAL 37
##
## # A tibble: 10 x 2
## Hospital.affiliation.CCN.5 num_provider
## <fct> <int>
## 1 <NA> 123981
## 2 520207 52
## 3 520139 49
## 4 520206 43
## 5 340060 38
## 6 100314 29
## 7 500077 27
## 8 360230 25
## 9 390183 25
## 10 360143 24
##
## # A tibble: 10 x 2
## Hospital.affiliation.LBN.5 num_provider
## <fct> <int>
## 1 <NA> 124060
## 2 AURORA MEDICAL CENTER 95
## 3 AURORA WEST ALLIS MEDICAL CENTER 49
## 4 MOREHEAD MEMORIAL HOSPITAL 38
## 5 DOCTORS HOSPITAL 31
## 6 WEST KENDALL BAPTIST HOSPITAL 29
## 7 PROVIDENCE HOLY FAMILY HOSPITAL 27
## 8 HILLCREST HOSPITAL 25
## 9 MEMORIAL HOSPITAL 25
## 10 ST LUKE'S MINERS MEMORIAL HOSPITAL 25
##
## # A tibble: 2 x 2
## Professional.accepts.Medicare.Assignment num_provider
## <fct> <int>
## 1 Y 128582
## 2 M 4479
##
## # A tibble: 2 x 2
## Reported.Quality.Measures num_provider
## <fct> <int>
## 1 Y 93356
## 2 <NA> 39705
##
## # A tibble: 2 x 2
## Used.electronic.health.records num_provider
## <fct> <int>
## 1 <NA> 98384
## 2 Y 34677
##
## # A tibble: 2 x 2
## Committed.to.heart.health.through.the.Million.Hearts..init~ num_provider
## <fct> <int>
## 1 <NA> 131953
## 2 Y 1108
##
## [1] "years.after.grad"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 10.00 19.00 20.24 29.00 70.00 330
There are many NA values. How many NA are in each columns? How many providers have a lot of NA values?
col.NA <-
provider %>%
summarise_all(funs(
signif(
sum(is.na(.)) / n(),
digits = 3)))
col.NA <-
as_tibble(cbind(column_names = names(col.NA), t(col.NA)))
names(col.NA)[2] <- c('Prop.NA')
col.NA$Prop.NA <- as.numeric(col.NA$Prop.NA)
col.NA <-
col.NA %>%
arrange(desc(Prop.NA)) %>%
print(n = Inf)
## # A tibble: 42 x 2
## column_names Prop.NA
## <chr> <dbl>
## 1 Secondary.specialty.4 1.00e+0
## 2 Secondary.specialty.3 9.98e-1
## 3 Committed.to.heart.health.through.the.Million.Hearts..initia~ 9.92e-1
## 4 Secondary.specialty.2 9.84e-1
## 5 Suffix 9.81e-1
## 6 Marker.of.address.line.2.suppression 9.58e-1
## 7 Hospital.affiliation.CCN.5 9.32e-1
## 8 Hospital.affiliation.LBN.5 9.32e-1
## 9 Hospital.affiliation.LBN.4 8.91e-1
## 10 Hospital.affiliation.CCN.4 8.90e-1
## 11 Secondary.specialty.1 8.56e-1
## 12 All.secondary.specialties 8.56e-1
## 13 Hospital.affiliation.LBN.3 8.09e-1
## 14 Hospital.affiliation.CCN.3 8.07e-1
## 15 Line.2.Street.Address 7.46e-1
## 16 Used.electronic.health.records 7.39e-1
## 17 Credential 6.62e-1
## 18 Hospital.affiliation.LBN.2 6.39e-1
## 19 Hospital.affiliation.CCN.2 6.37e-1
## 20 Reported.Quality.Measures 2.98e-1
## 21 Hospital.affiliation.LBN.1 2.93e-1
## 22 Hospital.affiliation.CCN.1 2.92e-1
## 23 Number.of.Group.Practice.members 2.57e-1
## 24 Middle.Name 2.45e-1
## 25 Phone.Number 1.40e-1
## 26 Organization.legal.name 1.00e-1
## 27 Group.Practice.PAC.ID 1.00e-1
## 28 Graduation.year 2.48e-3
## 29 years.after.grad 2.48e-3
## 30 First.Name 7.52e-6
## 31 Medical.school.name 7.52e-6
## 32 NPI 0.
## 33 PAC.ID 0.
## 34 Professional.Enrollment.ID 0.
## 35 Last.Name 0.
## 36 Gender 0.
## 37 Primary.specialty 0.
## 38 Line.1.Street.Address 0.
## 39 City 0.
## 40 State 0.
## 41 Zip.Code 0.
## 42 Professional.accepts.Medicare.Assignment 0.
ggplot(col.NA, aes(x = reorder(column_names,Prop.NA), y = Prop.NA)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Proportion of NAs in each column")
There are a bunch of columns that have NA, but not all of them are important. For example, Suffix and Secondary.specialty.4 are not important to have complete information. Doing some feature engineering to consolidate information, such as specialties and hospital affiliations.
Creating the variable has.secondary.specialty to consolidate the secondary specialty columns.
provider <-
provider %>%
mutate(has.secondary.specialty =
factor(ifelse(is.na(All.secondary.specialties), 'N', 'Y')))
provider %>%
group_by(has.secondary.specialty) %>%
dplyr::summarise(n() / nrow(provider))
## # A tibble: 2 x 2
## has.secondary.specialty `n()/nrow(provider)`
## <fct> <dbl>
## 1 N 0.856
## 2 Y 0.144
ggplot(provider,aes(x = has.secondary.specialty)) +
geom_bar() +
labs(title = "Frequency of providers with an additional specialty")
85.6% of the providers in this table do not have a secondary specialty and 14.4% do. Can now remove those columns about Secondary Specialty. This was somewhat surprising to see.
col.remove = c("Secondary.specialty.1", "Secondary.specialty.2",
"Secondary.specialty.3", "Secondary.specialty.4",
"All.secondary.specialties")
# Add columns that we engineered above
col.remove <- list.append(col.remove, "Graduation.year")
cat("These columns are still being considered for analysis: \n",
paste(setdiff(names(provider),col.remove), collapse = ' \n '))
## These columns are still being considered for analysis:
## NPI
## PAC.ID
## Professional.Enrollment.ID
## Last.Name
## First.Name
## Middle.Name
## Suffix
## Gender
## Credential
## Medical.school.name
## Primary.specialty
## Organization.legal.name
## Group.Practice.PAC.ID
## Number.of.Group.Practice.members
## Line.1.Street.Address
## Line.2.Street.Address
## Marker.of.address.line.2.suppression
## City
## State
## Zip.Code
## Phone.Number
## Hospital.affiliation.CCN.1
## Hospital.affiliation.LBN.1
## Hospital.affiliation.CCN.2
## Hospital.affiliation.LBN.2
## Hospital.affiliation.CCN.3
## Hospital.affiliation.LBN.3
## Hospital.affiliation.CCN.4
## Hospital.affiliation.LBN.4
## Hospital.affiliation.CCN.5
## Hospital.affiliation.LBN.5
## Professional.accepts.Medicare.Assignment
## Reported.Quality.Measures
## Used.electronic.health.records
## Committed.to.heart.health.through.the.Million.Hearts..initiative.
## years.after.grad
## has.secondary.specialty
Consolidate hospital affiliation information into has.hospital.affiliation.
provider <-
provider %>%
mutate(has.hospital.affiliation =
factor(ifelse((is.na(Hospital.affiliation.CCN.1) &
is.na(Hospital.affiliation.CCN.2) &
is.na(Hospital.affiliation.CCN.3) &
is.na(Hospital.affiliation.CCN.4) &
is.na(Hospital.affiliation.CCN.5)), 'N','Y')))
provider %>%
group_by(has.hospital.affiliation) %>%
dplyr::summarise(n() / nrow(provider))
## # A tibble: 2 x 2
## has.hospital.affiliation `n()/nrow(provider)`
## <fct> <dbl>
## 1 N 0.292
## 2 Y 0.708
ggplot(provider,aes(x = has.hospital.affiliation)) +
geom_bar() +
labs(title = "Frequency of providers with hospital affiliation")
This was also surprising to see because I expected more providers to operate within their own primary practices.
col.remove <- list.append(col.remove, c("Hospital.affiliation.CCN.1",
"Hospital.affiliation.CCN.2",
"Hospital.affiliation.CCN.3",
"Hospital.affiliation.CCN.4",
"Hospital.affiliation.CCN.5",
"Hospital.affiliation.LBN.1",
"Hospital.affiliation.LBN.2",
"Hospital.affiliation.LBN.3",
"Hospital.affiliation.LBN.4",
"Hospital.affiliation.LBN.5"))
cat("These columns are still being considered for analysis: \n",
paste(setdiff(names(provider),col.remove), collapse = ' \n '))
## These columns are still being considered for analysis:
## NPI
## PAC.ID
## Professional.Enrollment.ID
## Last.Name
## First.Name
## Middle.Name
## Suffix
## Gender
## Credential
## Medical.school.name
## Primary.specialty
## Organization.legal.name
## Group.Practice.PAC.ID
## Number.of.Group.Practice.members
## Line.1.Street.Address
## Line.2.Street.Address
## Marker.of.address.line.2.suppression
## City
## State
## Zip.Code
## Phone.Number
## Professional.accepts.Medicare.Assignment
## Reported.Quality.Measures
## Used.electronic.health.records
## Committed.to.heart.health.through.the.Million.Hearts..initiative.
## years.after.grad
## has.secondary.specialty
## has.hospital.affiliation
Create frequency plots for the other variables if low number of unique values
# Gender
ggplot(provider, aes(x = Gender)) +
geom_bar() +
labs(title = "Frequency of genders")
# Credential
ggplot(provider, aes(x = reorder(Credential,Credential,
function(x)length(x)))) +
geom_bar() +
labs(title = "Frequency of credentials") +
coord_flip()
Many providers have NA for the credentials, which is very strange. I would like to compare providers with no credential (population 1) with providers with credentials (population 2) to see if these populations differ at all.
# [TODO] Complete hypothesis testing
# Medical School Name
provider %>%
group_by(Medical.school.name) %>%
dplyr::summarise(freq = n()) %>%
dplyr::arrange(desc(freq)) %>%
top_n(n = 10,freq) %>%
ggplot(aes(x = reorder(Medical.school.name,freq), y = freq)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(title = "Top 10 medical school frequencies")
Due to the high number of unique values in Medical.school.name, this variable may need to be binned into a lower dimensional value set.
topthird.medical.school <-
provider %>%
group_by(Medical.school.name) %>%
dplyr::summarise(freq = n()) %>%
dplyr::arrange(desc(freq)) %>%
top_n(n = round(.34 * length(unique(provider$Medical.school.name))),freq) %>%
dplyr::select(Medical.school.name)
tailthird.medical.school <-
provider %>%
group_by(Medical.school.name) %>%
dplyr::summarise(freq = n()) %>%
dplyr::arrange(desc(freq)) %>%
top_n(n = -round(.33 * length(unique(provider$Medical.school.name))),freq) %>%
dplyr::select(Medical.school.name)
# [TODO] Repeat for the other variables with multiple factors
# Primary Specialty
provider %>%
group_by(Primary.specialty) %>%
dplyr::summarise(freq = n()) %>%
dplyr::arrange(desc(freq)) %>%
top_n(n = 10,freq) %>%
ggplot(aes(x = reorder(Primary.specialty,freq),y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 primary specialties")
Binned the lowest third of Primary.specialty into a “Rare” bucket
tailthird.specialty <-
provider %>%
group_by(Primary.specialty) %>%
dplyr::summarise(freq = n()) %>%
dplyr::arrange(desc(freq)) %>%
top_n(n = -round(.33 * length(unique(provider$Primary.specialty))),freq) %>%
dplyr::select(Primary.specialty)
# Group Practice Members
provider %>%
filter(!is.na(Number.of.Group.Practice.members)) %>%
ggplot(aes(x = Number.of.Group.Practice.members)) +
geom_histogram(binwidth = 50) +
labs(title = "Histogram of group practice members") +
geom_vline(aes(xintercept = mean(Number.of.Group.Practice.members)),
colour = "red") +
geom_text(aes(x=310, label="mean", y=300), colour="red")
Need to normalize this data. Lognormal was chosen to normalize this data over boxcox due to the computational simplicity and satisfactory results.
Refer to this document
provider$Number.of.Group.Practice.members_log <-
log(provider$Number.of.Group.Practice.members)
plotNormalHistogram(provider$Number.of.Group.Practice.members_log)
qqnorm(provider$Number.of.Group.Practice.members_log)
qqline(provider$Number.of.Group.Practice.members_log, col = 'red')
col.remove <- list.append(col.remove, 'Number.of.Group.Practice.members')
Looking at the Q-Q Plot of log normalized Number.of.Group.Practice.members, you can easily see that between [-1,1] theoretical quantiles, the data fits a normal distribution.
# State
provider %>%
group_by(State) %>%
dplyr::summarise(freq = n()) %>%
dplyr::arrange(desc(freq)) %>%
top_n(n = 10,freq) %>%
ggplot(aes(x = reorder(State,freq),y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 States")
CA, TX, and NY round out the top 3 states, which makes sense because they are some of the largest states with densely populated cities.
# Location
# Accepts Medicare
ggplot(provider,aes(x = Professional.accepts.Medicare.Assignment)) +
geom_bar() +
labs(title = "Frequency of providers accepting Medicare")
value of ‘M’ indicates a provider maybe accepts Medicare and value of ‘Y’ indicates a provider indeed accepts Medicare.
# Reported Quality Measures
ggplot(provider,aes(x = Reported.Quality.Measures)) +
geom_bar() +
labs(title = "Frequency of providers reporting quality measures")
Unsurprised that this is the distribution of providers reporting quality measures because some NA values are expected here.
# Used electronic health records
ggplot(provider,aes(x = Used.electronic.health.records)) +
geom_bar() +
labs(title = "Frequency of providers using electronic health records")
Did not expect so few providers to be using electronic health records! Since the only two values are ‘Y’ and NA, then the NA values could perhaps be ‘Y’ or some other value. However, instead of imputing a value, these were converted to ‘No Answer’ because there are no ‘N’ values to understand the distribution of actual values.
# Committed to heart health
ggplot(provider,aes(
x = Committed.to.heart.health.through.the.Million.Hearts..initiative.)) +
geom_bar() +
labs(title = "Frequency of providers committed to heart health")
Surprised to see the high ratio of NA:‘Y’ values for this variable. Perhaps not many members utilize this offering
# years after grad
provider %>%
filter(!is.na(years.after.grad)) %>%
ggplot(aes(x = years.after.grad)) +
geom_histogram(binwidth = 2) +
labs(title = "Histogram of years after graduating") +
geom_vline(aes(xintercept = mean(years.after.grad)),colour = "red") +
geom_text(aes(x = 23, label = "mean", y = 500), colour = "red")
Do we need to normalize this data? After checking, possibly not.
plotNormalHistogram(provider$years.after.grad)
qqnorm(provider$years.after.grad)
qqline(provider$years.after.grad, col = "red")
In the above, we found that there were many providers with NA credentials. We’d like to know if there is a relationship between having NA credentials and various other dependent variables. Only some variables were checked to understand if PCA would help reduce correlation between variables.
# Create vector of yes/no NA credentials
provider <-
provider %>%
mutate(credential.isNA =
factor(ifelse(is.na(Credential), 'Y', 'N')))
# Gender
chisq.test(table(provider$Gender,provider$credential.isNA))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(provider$Gender, provider$credential.isNA)
## X-squared = 2912.7, df = 1, p-value < 2.2e-16
With a p-value < 0.05, reject the null hypothesis that states there is no relationship between these variables.
# Medical school name
# fisher.test(table(provider$Medical.school.name,provider$credential.isNA))
# [TODO] Figure out way to bin medical schools
# Graduation year
# fisher.test(table(provider$Graduation.year,provider$credential.isNA))
# Primary specialty
# Group practice members
# State
# Medicare
chisq.test(provider$Professional.accepts.Medicare.Assignment,
provider$credential.isNA)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: provider$Professional.accepts.Medicare.Assignment and provider$credential.isNA
## X-squared = 2.9427, df = 1, p-value = 0.08627
With a p-value > 0.05, do not reject the null hypothesis that states there is no relationship between these variables.
# Quality Measures
provider$Reported.Quality.Measures <-
factor(provider$Reported.Quality.Measures,
levels = levels(addNA(provider$Reported.Quality.Measures)),
labels = c(levels(provider$Reported.Quality.Measures), "No Answer"),
exclude=NULL)
chisq.test(provider$Reported.Quality.Measures,provider$credential.isNA)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: provider$Reported.Quality.Measures and provider$credential.isNA
## X-squared = 1117.2, df = 1, p-value < 2.2e-16
With a p-value < 0.05, reject the null hypothesis that states there is no relationship between these variables.
prop.table(xtabs(~Reported.Quality.Measures + credential.isNA, data = provider))
## credential.isNA
## Reported.Quality.Measures N Y
## Y 0.25669430 0.44490873
## No Answer 0.08091026 0.21748672
Higher proportion of providers have a credential and report quality measures.
# electronic Health records
provider$Used.electronic.health.records <- factor(provider$Used.electronic.health.records, levels = levels(addNA(provider$Used.electronic.health.records)), labels = c(levels(provider$Used.electronic.health.records), "No Answer"),exclude=NULL)
chisq.test(provider$Used.electronic.health.records,provider$credential.isNA)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: provider$Used.electronic.health.records and provider$credential.isNA
## X-squared = 1538.8, df = 1, p-value < 2.2e-16
With a p-value < 0.05, reject the null hypothesis that states there is no relationship between these variables.
prop.table(xtabs(~Used.electronic.health.records + credential.isNA, data = provider))
## credential.isNA
## Used.electronic.health.records N Y
## Y 0.1103103 0.1502995
## No Answer 0.2272942 0.5120960
# heart health
provider$Committed.to.heart.health.through.the.Million.Hearts..initiative. <- factor(provider$Committed.to.heart.health.through.the.Million.Hearts..initiative., levels = levels(addNA(provider$Committed.to.heart.health.through.the.Million.Hearts..initiative.)), labels = c(levels(provider$Committed.to.heart.health.through.the.Million.Hearts..initiative.), "No Answer"),exclude=NULL)
chisq.test(provider$Committed.to.heart.health.through.the.Million.Hearts..initiative.,provider$credential.isNA)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: provider$Committed.to.heart.health.through.the.Million.Hearts..initiative. and provider$credential.isNA
## X-squared = 21.353, df = 1, p-value = 3.821e-06
With a p-value < 0.05, reject the null hypothesis that states there is no relationship between these variables.
prop.table(xtabs(~Committed.to.heart.health.through.the.Million.Hearts..initiative. + credential.isNA, data = provider))
## credential.isNA
## Committed.to.heart.health.through.the.Million.Hearts..initiative. N
## Y 0.003359361
## No Answer 0.334245196
## credential.isNA
## Committed.to.heart.health.through.the.Million.Hearts..initiative. Y
## Y 0.004967646
## No Answer 0.657427796
# secondary specialty
chisq.test(provider$has.secondary.specialty,provider$credential.isNA)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: provider$has.secondary.specialty and provider$credential.isNA
## X-squared = 761.27, df = 1, p-value < 2.2e-16
With a p-value < 0.05, reject the null hypothesis that states there is no relationship between these variables.
prop.table(xtabs(~has.secondary.specialty + credential.isNA, data = provider))
## credential.isNA
## has.secondary.specialty N Y
## N 0.27649725 0.57968901
## Y 0.06110731 0.08270643
# hospital affiliation
chisq.test(provider$has.hospital.affiliation,provider$credential.isNA)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: provider$has.hospital.affiliation and provider$credential.isNA
## X-squared = 777.85, df = 1, p-value < 2.2e-16
prop.table(xtabs(~has.hospital.affiliation + credential.isNA, data = provider))
## credential.isNA
## has.hospital.affiliation N Y
## N 0.08199998 0.20957305
## Y 0.25560457 0.45282239
# remove column `credential.isNA` if moving on
col.remove <- list.append(col.remove,'credential.isNA')
I decided to use graph cluster analysis because I was interested in seeing how hospital networks helped contribute to provider clustering. Below are the pros and cons of using graph cluster analysis for this purpose:
Pros:
* Can analyze many-to-many relationships
Cons:
* Difficult to create the graph
* Difficult to analyze
Refer to this
For the graph cluster analysis portion, I opted to use only the first two columns of hospital associations because building an adjacency matrix was a bit more complicated than creating node and edge lists. However, using an adjacency matrix may have made this a more interesting and robust analysis.
Create node list
hosp1 <- provider %>%
distinct(Hospital.affiliation.LBN.1) %>%
dplyr::rename(label = Hospital.affiliation.LBN.1)
hosp2 <- provider %>%
distinct(Hospital.affiliation.LBN.2) %>%
dplyr::rename(label = Hospital.affiliation.LBN.2)
hosp3 <- provider %>%
distinct(Hospital.affiliation.LBN.3) %>%
dplyr::rename(label = Hospital.affiliation.LBN.3)
hosp4 <- provider %>%
distinct(Hospital.affiliation.LBN.4) %>%
dplyr::rename(label = Hospital.affiliation.LBN.4)
nodes <-
join_all(list(hosp1,hosp2,hosp3,hosp4), by = "label", type = "full") %>%
rowid_to_column("id")
| id | label |
|---|---|
| 1 | NA |
| 2 | LITTLE RIVER HEALTHCARE |
| 3 | ST JOHN HOSPITAL AND MEDICAL CENTER |
| 4 | MEMORIAL HERMANN TEXAS MEDICAL CENTER |
| 5 | SSM HEALTH ST CLARE HOSPITAL - FENTON |
| 6 | HOAG ORTHOPEDIC INSTITUTE |
| 7 | ST ELIZABETH MEDICAL CENTER NORTH |
| 8 | ST JOHNS HOSPITAL |
| 9 | THOMAS JEFFERSON UNIVERSITY HOSPITAL |
| 10 | VALLEYCARE MEDICAL CENTER |
Create edge list
# edges <- provider %>%
# group_by_at(vars(Hospital.affiliation.LBN.1, Hospital.affiliation.LBN.2,
# Hospital.affiliation.LBN.3, Hospital.affiliation.LBN.4)) %>%
# dplyr::summarise(weight = n()) %>%
# ungroup()
edges <- provider %>%
group_by_at(vars(Hospital.affiliation.LBN.1, Hospital.affiliation.LBN.2)) %>%
dplyr::summarise(weight = n()) %>%
ungroup()
edges$Hospital.affiliation.LBN.1 <- mapvalues(edges$Hospital.affiliation.LBN.1,
from = nodes$label,
to = nodes$id,
warn_missing = FALSE)
edges$Hospital.affiliation.LBN.2 <- mapvalues(edges$Hospital.affiliation.LBN.2,
from = nodes$label,
to = nodes$id,
warn_missing = FALSE)
# edges$Hospital.affiliation.LBN.3 <- mapvalues(edges$Hospital.affiliation.LBN.3,
# from = nodes$label,
# to = nodes$id)
# edges$Hospital.affiliation.LBN.4 <- mapvalues(edges$Hospital.affiliation.LBN.4,
# from = nodes$label,
# to = nodes$id)
Remove NA values from edge list.
edges <-
edges %>%
filter_all(all_vars(!is.na(.)))
edges$weight <- as.numeric(edges$weight)
| Hospital.affiliation.LBN.1 | Hospital.affiliation.LBN.2 | weight |
|---|---|---|
| 2170 | 4001 | 1 |
| 2170 | 1817 | 1 |
| 2050 | 231 | 1 |
| 2050 | 1146 | 1 |
| 2050 | 173 | 1 |
| 160 | 158 | 1 |
| 160 | 910 | 5 |
| 160 | 2010 | 7 |
| 160 | 2814 | 1 |
| 160 | 3167 | 2 |
Remove values that have very small weights <10% of max weight.
edges <-
edges %>%
filter(weight >= ceiling(max(edges$weight) * .10))
nrow(edges)
## [1] 408
Remove the corresponding nodes that will not be used.
# rm.nodes <- union(union(union(edges$Hospital.affiliation.LBN.1,edges$Hospital.affiliation.LBN.2),edges$Hospital.affiliation.LBN.3), edges$Hospital.affiliation.LBN.4)
rm.nodes <- union(edges$Hospital.affiliation.LBN.1,edges$Hospital.affiliation.LBN.2)
Number of nodes to keep: 558.
Filter nodes list to keep nodes found in the first two hospital affiliation columns.
nodes <-
nodes %>%
filter(id %in% rm.nodes)
Visualize network
visNetwork(nodes,
setNames(edges,c('from','to', 'value')),
height = "700px", width = "100%") %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = list(style = 'width: 100%; height: 26px;
background: #f8f8f8;
color: darkblue;
border:none;
outline:none;')) %>%
visPhysics(stabilization = FALSE) %>%
visLayout(randomSeed = 100)
Create network object for analysis purposes
hosp_tidy <- tbl_graph(nodes = nodes, edges = setNames(edges,c('from','to', 'value')), directed = FALSE)
hosp_tidy
## # A tbl_graph: 558 nodes and 408 edges
## #
## # An undirected multigraph with 223 components
## #
## # Node Data: 558 x 2 (active)
## id label
## <int> <fct>
## 1 3 ST JOHN HOSPITAL AND MEDICAL CENTER
## 2 4 MEMORIAL HERMANN TEXAS MEDICAL CENTER
## 3 7 ST ELIZABETH MEDICAL CENTER NORTH
## 4 8 ST JOHNS HOSPITAL
## 5 14 METHODIST HOSPITAL,THE
## 6 16 SPECTRUM HEALTH - BUTTERWORTH CAMPUS
## # ... with 552 more rows
## #
## # Edge Data: 408 x 3
## from to value
## <int> <int> <dbl>
## 1 1 2 19
## 2 1 3 19
## 3 1 4 36
## # ... with 405 more rows
Fraction of edges present relative to total possible edges
edge_density(
graph = hosp_tidy,
loops = F
)
## [1] 0.002625432
Fraction of triangles (completely connected 3 nodes) / all triangles
transitivity(
graph = hosp_tidy,
type = 'global' # 'local'
)
## [1] 0.01898734
With a transitivity = 1, the network contains all possible edges. In this case, not all edges are present between hospitals.
Find cliques and give clique sizes
hist(
sapply(
cliques(hosp_tidy),
length
)
) # clique sizes
Cliques are described as a set of nodes where all possible connections between nodes exist, which may indicate a stronger version of community. Source.
As shown here, we have a low number of cliques that form triangles (three node clique). As such, it may be better to look for network structures that are weaker to understand hospital associations.
Cliques with max number of nodes
largest_cliques(hosp_tidy)
## [[1]]
## + 3/558 vertices, from 6e8f8d0:
## [1] 425 422 423
Cluster using various method
The different clustering methods will be evaluated according to modularity and variation of information (VI). VI measures the amount of information lost and gained in changing from one clustering method to another method. In this evaluation, low VI indicates that the clusterings are fairly similar. Modularity measures how dense the connections are between nodes within modules. It looks to see if the number of edges in a cluster is comparable to the number of edges in a cluster found in a random network.
Refer to this article for info on VI.
Refer to this article for information on modularity.
Get the leading Eigen value clusters
According to this document, leading eigenvector community structure is detected by finding “densely connected subgraphs by calculating the leading non-negative eigenvector of the modularity matrix of the graph.”
cls_eigen <-
cluster_leading_eigen(hosp_tidy)
table(
membership(cls_eigen)
)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 7 2 2 2 2 2 3 2 2 2 4 2 2 2 2 4 2 4
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 2 6 3 3 2 3 10 2 3 3 2 2 3 3 3 2 2 2
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 3 2 3 5 5 2 2 2 2 3 2 2 2 2 2 2 3 2
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 3 3 2 2 2 3 2 3 3 2 2 2 3 2 3 2 2 2
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 3 2 2 2 4 2 2 2 3 2 4 4 2 5 5 4 2 2
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 2 2 3 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 2 2 2 10 2 2 2 2 2 4 4 2 2 2 2 2 2
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 2 2 2 2 3 2 2 2 2 2 2 3 2 4 2 3 2 2
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 2 2 2 2 2 2 2 3 4 2 3 5 2 2 2 2 2 2
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 5 3 3 3 2 2 2 2 4 3 4 2 2 3 2 2 2 2
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## 2 2 5 2 2 2 2 2 2 2 2 2 2 2 2 2 5 2
## 217 218 219 220 221 222 223
## 2 2 2 3 2 2 2
Modularity of Leading Eigen community finding algorithm is 0.992635.
hist(sizes(cls_eigen))
Get the Louvain clusters
Using the Louvain function to find community structure implements both modularity optimization and a hierarchical approach. Source
cls_louvain <-
cluster_louvain(hosp_tidy)
table(
membership(cls_louvain)
)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 2 2 2 2 2 3 2 2 2 4 2 2 2 2 4 2 4 2
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 3 3 2 3 2 3 3 2 2 3 3 3 2 2 2 3 2 3
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 2 2 2 2 3 6 2 2 2 2 2 2 3 2 3 5 3 2
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 2 2 3 2 3 3 2 2 2 3 2 3 2 2 2 3 2 2
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 2 4 2 2 2 3 2 4 2 5 5 4 2 2 2 2 2 2
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 2 3 2
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 3 2 2 2 2 2 2 2 3 7 2 2 2 2 2 2 2 2
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 2 2 2 2 2 4 2 2 2 2 5 2 2 2 2 2 2
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 4 3 2 2 2 2 2 2 3 2 4 2 3 2 2 2 2 2
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 2 2 2 2 3 4 2 3 5 2 2 2 2 2 2 5 3 3
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 3 2 2 10 2 2 4 3 4 2 2 3 2 2 2 2 10 2
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## 2 5 2 2 2 2 2 2 2 2 2 2 2 2 2 5 2 2
## 217 218 219 220 221 222 223
## 3 2 2 3 2 2 2
Modularity of Louvain community finding algorithm is 0.992635.
hist(sizes(cls_louvain))
Example of one of the largest communities in the hospital network
plot(induced_subgraph(hosp_tidy,cls_louvain[[184]]))
Get the Walktrap clusters
Community structure via short random walks is built on the idea that “short random walks tend to stay in the same community.” Source
cls_wt <-
cluster_walktrap(
hosp_tidy,
steps = 4
)
table(
membership(cls_wt)
)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 10 10 7 6 5 5 5 4 4 5 4 4 4 5 4 4 4 4
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 3 3 3 3 4 4 4 4 3 3 3 3 3 3 3 3 3 3
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 217 218 219 220 221 222 223
## 2 2 2 2 2 2 2
Modularity of Walktrap community finding algorithm is 0.9926347.
hist(sizes(cls_wt))
Modularity between the three clustering methods are the same and is fairly high, which indicates that the network has “dense connections between nodes within modules but sparse connections between nodes in different modules.” Source
In regard to hospital networks, this may be an accurate analysis because providers in a hospital network may be limited to a certain distance range. Therefore, providers within a certain area may be associated with the hospitals in that region, and distance is the deciding feature.
# compare(
# cls_louvain,
# cls_wt,
# method = 'vi'
# )
# [TODO] Need to change edge list to have 2 columns with from and to.
# [UPDATE] Create adjacency matrix to encompass all of the hospital affiliations columns
# [TODO] Need to figure out a way to remove the NA
# [UPDATE] Removed NA from two column "from" "to" column
# [DONE] Removed values that had NA from edge list
# [TODO] Need to add weights.
# [UPDATE] Changed column name from "weights" to "value" which is what visNetwork wants
# [DONE]
# [TODO] How to handle providers that do not have a hospital affiliation? Maybe just calculate ratio of providers without hospital affiliation and those with hospital affiliation?
# [DONE] Created new feature `has.hospital.affiliation`
# Decide what to do with all NA
provider$Credential <-
factor(provider$Credential,
levels = levels(addNA(provider$Credential)),
labels = c(levels(provider$Credential), "No Answer"),
exclude = NULL)
provider$Medical.school.name <-
factor(provider$Medical.school.name,
levels = levels(addNA(provider$Medical.school.name)),
labels = c(levels(provider$Medical.school.name), "No Answer"),
exclude = NULL)
Impute NA values in numerical columns using mean since the variables have been transformed to a normal distribution.
provider[is.na(provider[,"years.after.grad"]), "years.after.grad"] <-
round(mean(provider$years.after.grad, na.rm = TRUE))
provider[is.na(provider[,"Number.of.Group.Practice.members_log"]),
"Number.of.Group.Practice.members_log"] <-
round(mean(provider$Number.of.Group.Practice.members_log, na.rm = TRUE))
Standardize and center variables from 0-1
scale.center <- function(data) {
output = (data - min(data)) / (max(data) - min(data))
return(output)
}
provider$years.after.grad <- scale.center(provider$years.after.grad)
provider$Number.of.Group.Practice.members_log <- scale.center(provider$Number.of.Group.Practice.members_log)
Bin categorical variables that have too many factors
provider$Medical.school.name <-
as.factor(case_when(provider$Medical.school.name %in% topthird.medical.school[[1]] ~ "Top",
provider$Medical.school.name %in% tailthird.medical.school[[1]] ~ "Bottom",
TRUE ~ "Middle"))
provider %>%
ggplot(aes(x = Medical.school.name)) +
geom_bar() +
coord_flip()
levels(provider$Primary.specialty)[
which(levels(provider$Primary.specialty) %in% tailthird.specialty[[1]])] <-
"RARE SPECIALTY"
provider %>%
ggplot(aes(x = Primary.specialty)) +
geom_bar() +
coord_flip()
Decide which columns to analyze
identity.col <- c("NPI","PAC.ID","Professional.Enrollment.ID","Last.Name","First.Name","Middle.Name",
"Suffix","Organization.legal.name","Group.Practice.PAC.ID","Line.1.Street.Address",
"Line.2.Street.Address", "Marker.of.address.line.2.suppression","City","Zip.Code",
"Phone.Number")
cat("Number of columns left for analysis: ",
ncol(provider) - (length(identity.col) + length(col.remove)))
## Number of columns left for analysis: 13
processed.provider <- provider[ , !(names(provider) %in%
list.append(identity.col,col.remove))]
summary(processed.provider)
## Gender Credential Medical.school.name
## F:57430 No Answer:88139 Bottom: 304
## M:75631 MD :31562 Middle: 10580
## PA : 2766 Top :122177
## NP : 2023
## DO : 1866
## CNA : 1172
## (Other) : 5533
## Primary.specialty State
## NURSE PRACTITIONER :12674 CA :10754
## INTERNAL MEDICINE :11366 TX : 9452
## PHYSICIAN ASSISTANT :11067 NY : 8532
## FAMILY PRACTICE :10057 PA : 7891
## DIAGNOSTIC RADIOLOGY: 8560 FL : 7452
## PHYSICAL THERAPY : 5271 MI : 5805
## (Other) :74066 (Other):83175
## Professional.accepts.Medicare.Assignment Reported.Quality.Measures
## M: 4479 Y :93356
## Y:128582 No Answer:39705
##
##
##
##
##
## Used.electronic.health.records
## Y :34677
## No Answer:98384
##
##
##
##
##
## Committed.to.heart.health.through.the.Million.Hearts..initiative.
## Y : 1108
## No Answer:131953
##
##
##
##
##
## years.after.grad has.secondary.specialty has.hospital.affiliation
## Min. :0.0000 N:113925 N:38797
## 1st Qu.:0.1429 Y: 19136 Y:94264
## Median :0.2714
## Mean :0.2891
## 3rd Qu.:0.4143
## Max. :1.0000
##
## Number.of.Group.Practice.members_log
## Min. :0.0000
## 1st Qu.:0.4306
## Median :0.5325
## Mean :0.5503
## 3rd Qu.:0.7299
## Max. :1.0000
##
Complete one-hot encoding for the categorical variables
encoder <- onehot(processed.provider,max_levels = 350)
encode.processed.provider <- predict(encoder,processed.provider)
str(encode.processed.provider)
## num [1:133061, 1:150] 1 1 0 1 1 0 0 1 0 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:150] "Gender=F" "Gender=M" "Credential=AA" "Credential=AU" ...
Principal component analysis was completed because there was evidence of endogeneity from the hypothesis testing results. Many subsequent machine learning algorithms require independent variables.
pca <- PCA(encode.processed.provider,graph = FALSE)
kable(head(pca$eig, n = 20))
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 4.699112 | 3.1327415 | 3.132741 |
| comp 2 | 3.102769 | 2.0685127 | 5.201254 |
| comp 3 | 2.291904 | 1.5279358 | 6.729190 |
| comp 4 | 2.101767 | 1.4011778 | 8.130368 |
| comp 5 | 2.047152 | 1.3647681 | 9.495136 |
| comp 6 | 1.994164 | 1.3294429 | 10.824579 |
| comp 7 | 1.913033 | 1.2753554 | 12.099934 |
| comp 8 | 1.748977 | 1.1659850 | 13.265919 |
| comp 9 | 1.659097 | 1.1060648 | 14.371984 |
| comp 10 | 1.623727 | 1.0824848 | 15.454469 |
| comp 11 | 1.588917 | 1.0592779 | 16.513747 |
| comp 12 | 1.563208 | 1.0421389 | 17.555885 |
| comp 13 | 1.551110 | 1.0340731 | 18.589959 |
| comp 14 | 1.525877 | 1.0172517 | 19.607210 |
| comp 15 | 1.513438 | 1.0089585 | 20.616169 |
| comp 16 | 1.478635 | 0.9857567 | 21.601926 |
| comp 17 | 1.443629 | 0.9624195 | 22.564345 |
| comp 18 | 1.409683 | 0.9397886 | 23.504133 |
| comp 19 | 1.362085 | 0.9080568 | 24.412190 |
| comp 20 | 1.352184 | 0.9014563 | 25.313647 |
fviz_screeplot(pca, ncp = 150) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Graph of variables
fviz_pca_var(pca, col.var = 'contrib')
In this graph, you can understand variables that correlate positively and negatively. For example, some variables that correlate positively:
* Male gender + MD credential
* Reported quality measures + hospital affiliated
* Female gender + nurse practictioner primary specialty
Graph of individuals
fviz_pca_ind(pca, geom = 'point', col.ind = 'cos2')
Overall, dimensionality reduction using PCA was not helpful because each principal component was responsible for a minute amount of variance. As a result, each variable contributed very little to each principal component. Looking at the scree plot, the elbow is actually closer to 125 principal components because each component explains only 3-1% of variance.
| component | eigenvalue | percentage of variance | cumulative percentage of variance |
|---|---|---|---|
| comp 125 | 6.440340e-01 | 4.293560e-01 | 95.47559 |
125 component accounts for 95.5% of the variance in the data, so keeping 125 components reduces the number of predictors by 16.7%. This is another function that calculates PCA. For whatever reason, I am only able to get 5 dimensions from the original PCA results instead of the 125 components that would be necessary. This was not used for further analysis but for a quick sanity check. In future iterations, it may be more helpful to go through this route instead.
pca.prcomp <- prcomp(encode.processed.provider, scale. = TRUE)
autoplot(pca.prcomp,encode.processed.provider,
loadings = TRUE,loadings.label = TRUE)
pca.provider <- pca.prcomp$x[,1:125]
K-Means was chosen as a clustering method because of its simplicity. Below are the pros and cons of using this method.
Pros:
* Fast to run
Cons:
* Only works well for spherical clusters
* Difficult to ascertain the number of clusters
* Difficult to work with outliers
Refer to this for more information
Determine number of clusters
wss <- (nrow(encode.processed.provider) - 1)*sum(apply(encode.processed.provider,2,var))
for (i in 2:40) wss[i] <- sum(kmeans(encode.processed.provider,
centers = i,
trace = TRUE)$withinss)
plot(1:40, wss, type = "b", xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
Refer to this document
For this scree plot, it is difficult to tell exactly where the elbow is since there is a slow gradient leveling off at 30 clusters. By visual inspection and acknowledging Occam’s razor, 9 clusters was chosen as the elbow and model parameter.
k <- kmeans(x = pca$ind$coord, 9, , nstart=25, iter.max=1000)
plot(pca$ind$coord, col = k$clust,pch = 16)
As you can see, there is much overlap in clusters, which suggests that a Gaussian Mixture Model might be helpful because it allows for mixed cluster membership. Furthermore, this plot is only demonstrating principal component 1 and 2, which account for a low amount of variation in the data. Therefore, this representation must be taken with a grain of salt. In the future, it would be prudent to complete pairplots to analyze clusters across multiple dimensions.
# take 2
# k.prcomp <- kmeans(x = pca.provider, 9 , nstart = 25, iter.max = 1000)
# plot(pca.provider, col = k.prcomp$clust,pch=16)
How can we decipher these clusters?
For interpretation purposes, the most frequent value in each categorical variable and the mean value in each numerical variable is reported.
cluster.processed.provider <- cbind(processed.provider,factor(k$cluster))
for (cluster in sort(unique(k$cluster))) {
temp <- cluster.processed.provider %>%
filter(k$cluster == cluster)
cat("Summarizing cluster: ", cluster)
cat("\n")
# summary.kmeans$`factor(k$cluster)`[iter] <- cluster
for (col in names(temp)) {
if (class(temp[[col]]) != "numerical") {
print(paste(col, names(which.max(table(temp[[col]]))), sep = ": "))
}
else {
print(paste(col, mean(temp[[col]]), sep =": "))
}
}
cat("\n")
}
## Summarizing cluster: 1
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: INTERNAL MEDICINE"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.228571428571429"
## [1] "has.secondary.specialty: Y"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 1"
##
## Summarizing cluster: 2
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: PHYSICAL THERAPY"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: No Answer"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.257142857142857"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: N"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 2"
##
## Summarizing cluster: 3
## [1] "Gender: F"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: NURSE PRACTITIONER"
## [1] "State: TX"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: No Answer"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.0285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: N"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 3"
##
## Summarizing cluster: 4
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Middle"
## [1] "Primary.specialty: CHIROPRACTIC"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: No Answer"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.257142857142857"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: N"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 4"
##
## Summarizing cluster: 5
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: FAMILY PRACTICE"
## [1] "State: FL"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: Y"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: Y"
## [1] "years.after.grad: 0.285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 5"
##
## Summarizing cluster: 6
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: FAMILY PRACTICE"
## [1] "State: PA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: Y"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 6"
##
## Summarizing cluster: 7
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: DIAGNOSTIC RADIOLOGY"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 7"
##
## Summarizing cluster: 8
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Middle"
## [1] "Primary.specialty: FAMILY PRACTICE"
## [1] "State: OH"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.242857142857143"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 8"
##
## Summarizing cluster: 9
## [1] "Gender: F"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: NURSE PRACTITIONER"
## [1] "State: PA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.114285714285714"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(k$cluster): 9"
K-Medoid was chosen as the next choice for clustering because K-Means worked relatively well. K-Medoid is implemented through the R function, partitioning around medoids (PAM), which “minimizes a sum of dissimilarities instead of a sum of squared euclidean distance.” Source
PAM works with medoids (samples of the dataset that represents the group) while K-Means works with centroids (artificially created entities that represent the cluster). As such, PAM could be more representative of the actual dataset.
In PAM, PCA is completed internally, so the encoded provider data was used instead of the PCA provider data. Furthermore, this data was sampled to fit into the memory allocation needed and to speed up run time.
sample.encode.processed.provider <-
sample_n(as.data.frame(encode.processed.provider),
size = ceiling(.1*nrow(encode.processed.provider)),
replace = TRUE)
pamx <- pam(x = sample.encode.processed.provider, 9)
# summary(pamx)
How do these clusters differ?
To decipher, remove all columns from the medoid dataframe that are all 0’s. Print out the medoids.
pam.medoid <- as.data.frame(pamx$medoids)
pam.col.list = c()
for (col in names(pam.medoid)) {
if (any(pam.medoid[[col]]) != 0) {
print(col)
pam.col.list <- list.append(pam.col.list,col)
}
}
## [1] "Gender=F"
## [1] "Gender=M"
## [1] "Credential=MD"
## [1] "Credential=No Answer"
## [1] "Medical.school.name=Top"
## [1] "Primary.specialty=DIAGNOSTIC RADIOLOGY"
## [1] "Primary.specialty=FAMILY PRACTICE"
## [1] "Primary.specialty=INTERNAL MEDICINE"
## [1] "Primary.specialty=NURSE PRACTITIONER"
## [1] "Primary.specialty=PHYSICAL THERAPY"
## [1] "Primary.specialty=PHYSICIAN ASSISTANT"
## [1] "State=CA"
## [1] "State=FL"
## [1] "State=IL"
## [1] "State=NY"
## [1] "State=PA"
## [1] "State=TX"
## [1] "Professional.accepts.Medicare.Assignment=Y"
## [1] "Reported.Quality.Measures=Y"
## [1] "Reported.Quality.Measures=No Answer"
## [1] "Used.electronic.health.records=Y"
## [1] "Used.electronic.health.records=No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.=No Answer"
## [1] "years.after.grad"
## [1] "has.secondary.specialty=N"
## [1] "has.hospital.affiliation=N"
## [1] "has.hospital.affiliation=Y"
## [1] "Number.of.Group.Practice.members_log"
pam.medoid[pam.col.list]
## Gender=F Gender=M Credential=MD Credential=No Answer
## 60319 0 1 0 1
## 88693 0 1 0 1
## 109463 0 1 1 0
## 100123 0 1 1 0
## 102081 0 1 0 1
## 113620 0 1 0 1
## 111678 1 0 0 1
## 100867 1 0 0 1
## 121706 1 0 0 1
## Medical.school.name=Top Primary.specialty=DIAGNOSTIC RADIOLOGY
## 60319 1 0
## 88693 1 0
## 109463 1 0
## 100123 1 1
## 102081 1 0
## 113620 1 1
## 111678 1 0
## 100867 1 0
## 121706 1 0
## Primary.specialty=FAMILY PRACTICE
## 60319 0
## 88693 1
## 109463 0
## 100123 0
## 102081 0
## 113620 0
## 111678 0
## 100867 0
## 121706 0
## Primary.specialty=INTERNAL MEDICINE
## 60319 0
## 88693 0
## 109463 1
## 100123 0
## 102081 1
## 113620 0
## 111678 0
## 100867 0
## 121706 0
## Primary.specialty=NURSE PRACTITIONER
## 60319 0
## 88693 0
## 109463 0
## 100123 0
## 102081 0
## 113620 0
## 111678 0
## 100867 1
## 121706 1
## Primary.specialty=PHYSICAL THERAPY
## 60319 1
## 88693 0
## 109463 0
## 100123 0
## 102081 0
## 113620 0
## 111678 0
## 100867 0
## 121706 0
## Primary.specialty=PHYSICIAN ASSISTANT State=CA State=FL State=IL
## 60319 0 1 0 0
## 88693 0 0 1 0
## 109463 0 0 0 1
## 100123 0 0 0 0
## 102081 0 1 0 0
## 113620 0 0 0 0
## 111678 1 1 0 0
## 100867 0 0 1 0
## 121706 0 0 0 0
## State=NY State=PA State=TX
## 60319 0 0 0
## 88693 0 0 0
## 109463 0 0 0
## 100123 1 0 0
## 102081 0 0 0
## 113620 0 0 1
## 111678 0 0 0
## 100867 0 0 0
## 121706 0 1 0
## Professional.accepts.Medicare.Assignment=Y
## 60319 1
## 88693 1
## 109463 1
## 100123 1
## 102081 1
## 113620 1
## 111678 1
## 100867 1
## 121706 1
## Reported.Quality.Measures=Y Reported.Quality.Measures=No Answer
## 60319 0 1
## 88693 1 0
## 109463 1 0
## 100123 1 0
## 102081 0 1
## 113620 1 0
## 111678 0 1
## 100867 1 0
## 121706 1 0
## Used.electronic.health.records=Y
## 60319 0
## 88693 1
## 109463 1
## 100123 0
## 102081 0
## 113620 0
## 111678 0
## 100867 0
## 121706 0
## Used.electronic.health.records=No Answer
## 60319 1
## 88693 0
## 109463 0
## 100123 1
## 102081 1
## 113620 1
## 111678 1
## 100867 1
## 121706 1
## Committed.to.heart.health.through.the.Million.Hearts..initiative.=No Answer
## 60319 1
## 88693 1
## 109463 1
## 100123 1
## 102081 1
## 113620 1
## 111678 1
## 100867 1
## 121706 1
## years.after.grad has.secondary.specialty=N
## 60319 0.2714286 1
## 88693 0.2857143 1
## 109463 0.3857143 1
## 100123 0.4571429 1
## 102081 0.3428571 1
## 113620 0.2285714 1
## 111678 0.2000000 1
## 100867 0.1000000 1
## 121706 0.1571429 1
## has.hospital.affiliation=N has.hospital.affiliation=Y
## 60319 1 0
## 88693 0 1
## 109463 0 1
## 100123 0 1
## 102081 0 1
## 113620 0 1
## 111678 1 0
## 100867 1 0
## 121706 0 1
## Number.of.Group.Practice.members_log
## 60319 0.3348755
## 88693 0.6130291
## 109463 0.5325392
## 100123 0.5325392
## 102081 0.5325392
## 113620 0.5899839
## 111678 0.5325392
## 100867 0.5394722
## 121706 0.5999897
Agglomerative hierarchical clustering was initially chosen because it is more informative than the flat unstructured clusters from K-Means and for its ease of implementation. However, some cons I experienced were that hierarchical clustering was not suitable for large datasets and since points assigned to a cluster cannot be moved around, order of the data and the initial seeds have a strong impact. Source
# d <- dist(pca$ind$coord, method = 'euclidean')
# hc1 <- hclust(d, method = 'complete')
# plot(hc1,cex=0.6,hang=-1)
#
# hc2 <- agnes(pca$ind$coord, method = 'complete')
Using the traditional hclust function with a distance matrix was unsuccessful due to the high computational complexity and memory allocation.
# Trying HCPC function which completes hierarchical clustering on Principle Components (NCPC)
# hc <- HCPC(pca.provider, nb.clust = -1)
In future work, it may be helpful to sample a smaller proportion of the data initially so that AHC may be used in this analysis.
Gaussian Mixture Model is a parametric model that assumes that the data points are generated from Gaussian distributions.
Refer to this document. and this.
fit <- Mclust(pca$ind$coord)
fviz_mclust(
fit,
what = 'BIC',
palette = 'npg'
)
VVV is the best fit with 9 clusters
summary(fit, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9 components:
##
## log.likelihood n df BIC ICL
## -839817.8 133061 188 -1681854 -1696174
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 27287 22169 19422 16549 5570 8991 18113 9773 5187
##
## Mixing probabilities:
## 1 2 3 4 5 6
## 0.21320978 0.15955173 0.15081516 0.12348042 0.03988591 0.06621059
## 7 8 9
## 0.13471645 0.07423848 0.03789147
##
## Means:
## [,1] [,2] [,3] [,4] [,5] [,6]
## Dim.1 -0.25760612 -0.1649629 -0.7364784 2.5916190 -2.2232883 -3.3607102
## Dim.2 -0.34537883 -1.8464567 2.5844256 0.4369785 -1.9238568 -0.6434179
## Dim.3 0.17933025 -0.7407867 -1.5036058 2.0083183 -0.2571089 0.8413845
## Dim.4 0.88464926 0.1298260 -0.9907032 -1.8073077 -0.7229012 -0.5603507
## Dim.5 -0.04524785 -0.5797503 -0.2338069 -0.8135371 -0.2788809 0.8173275
## [,7] [,8] [,9]
## Dim.1 2.2914168 -2.5022949 1.5985047
## Dim.2 -0.1920530 0.8658863 0.1435979
## Dim.3 -0.8485591 1.6433212 0.1478182
## Dim.4 0.5142000 0.9033222 2.4504885
## Dim.5 0.7971579 1.2553416 -0.1507534
##
## Variances:
## [,,1]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 1.4880038 0.3298327 -0.16716791 0.49081804 0.14344023
## Dim.2 0.3298327 1.2814949 0.57331116 0.35400337 0.46001191
## Dim.3 -0.1671679 0.5733112 0.51660921 0.09952362 0.03416554
## Dim.4 0.4908180 0.3540034 0.09952362 0.89982466 -0.03748597
## Dim.5 0.1434402 0.4600119 0.03416554 -0.03748597 0.66178017
## [,,2]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 0.439418329 0.55369334 0.16793108 0.46813427 -0.008344946
## Dim.2 0.553693339 0.81268353 0.26695259 0.63406504 0.017131371
## Dim.3 0.167931083 0.26695259 0.11759047 0.20620032 0.005170710
## Dim.4 0.468134273 0.63406504 0.20620032 0.61224067 -0.032608858
## Dim.5 -0.008344946 0.01713137 0.00517071 -0.03260886 0.079758532
## [,,3]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 5.14889983 -1.7905766 -0.05424913 -0.65669301 1.9504437
## Dim.2 -1.79057656 3.9964580 -1.02330422 0.46568008 -1.6702216
## Dim.3 -0.05424913 -1.0233042 4.42626727 0.01071176 -0.4474119
## Dim.4 -0.65669301 0.4656801 0.01071176 2.72969893 -2.3507060
## Dim.5 1.95044373 -1.6702216 -0.44741187 -2.35070601 8.9138354
## [,,4]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 1.8659284 0.3028319 -0.4165171 0.2332581 0.5094348
## Dim.2 0.3028319 0.8075265 0.3696331 0.6083532 0.2810485
## Dim.3 -0.4165171 0.3696331 0.5516710 0.2561485 -0.1577904
## Dim.4 0.2332581 0.6083532 0.2561485 1.0899806 0.1470555
## Dim.5 0.5094348 0.2810485 -0.1577904 0.1470555 0.5016593
## [,,5]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 0.0288076194 -0.01747828 -0.0008022328 0.02824471 -0.04963598
## Dim.2 -0.0174782787 0.05801973 0.0168346545 0.02311488 0.04100444
## Dim.3 -0.0008022328 0.01683465 0.0515012130 0.01302344 -0.06906281
## Dim.4 0.0282447145 0.02311488 0.0130234371 0.12016781 -0.02578051
## Dim.5 -0.0496359763 0.04100444 -0.0690628108 -0.02578051 0.33002261
## [,,6]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 0.37197524 -0.2019677 -0.2500266 0.02853637 0.04901762
## Dim.2 -0.20196768 0.3478089 0.3625857 0.14199830 0.15141751
## Dim.3 -0.25002663 0.3625857 0.4245879 0.13988543 0.14810334
## Dim.4 0.02853637 0.1419983 0.1398854 0.13959581 0.11634576
## Dim.5 0.04901762 0.1514175 0.1481033 0.11634576 0.19661086
## [,,7]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 0.49034869 0.46598755 0.24243094 0.5022206 0.06229629
## Dim.2 0.46598755 0.59217575 0.27685526 0.5747843 0.08894238
## Dim.3 0.24243094 0.27685526 0.24431227 0.2049949 0.02300664
## Dim.4 0.50222057 0.57478429 0.20499489 0.8146556 0.14663959
## Dim.5 0.06229629 0.08894238 0.02300664 0.1466396 0.08917083
## [,,8]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 0.91738451 0.03440719 -0.5931789 0.1601886 -0.35438522
## Dim.2 0.03440719 0.71171383 0.5504263 0.5206445 0.08671913
## Dim.3 -0.59317894 0.55042629 1.0899336 0.5121153 0.55628477
## Dim.4 0.16018855 0.52064451 0.5121153 0.6885001 0.33544012
## Dim.5 -0.35438522 0.08671913 0.5562848 0.3354401 0.69065602
## [,,9]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Dim.1 0.033001333 0.001093541 -0.006271898 0.01381299 0.009502095
## Dim.2 0.001093541 0.071583927 0.037006757 0.03204437 0.036003907
## Dim.3 -0.006271898 0.037006757 0.055215151 0.02447752 -0.006118363
## Dim.4 0.013812991 0.032044372 0.024477518 0.12654093 -0.022250622
## Dim.5 0.009502095 0.036003907 -0.006118363 -0.02225062 0.072595232
What is the uncertainty associated with the classification prediction?
summary(fit$uncertainty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.001102 0.005195 0.046204 0.033957 0.649018
boxplot(fit$uncertainty, horizontal = TRUE)
There seem to be a high number of outliers, so there may be much skewness in the data.
Interpret clusters
gmm.cluster.processed.provider <- cbind(processed.provider,factor(fit$classification))
for (cluster in sort(unique(fit$classification))) {
temp <- gmm.cluster.processed.provider %>%
filter(fit$classification == cluster)
cat("Summarizing cluster: ", cluster)
cat("\n")
for (col in names(temp)) {
if (class(temp[[col]]) != "numerical") {
print(paste(col, names(which.max(table(temp[[col]]))), sep = ": "))
}
else {
print(paste(col, mean(temp[[col]]), sep = ": "))
}
}
cat("\n")
}
## Summarizing cluster: 1
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: PHYSICIAN ASSISTANT"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.2"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 1"
##
## Summarizing cluster: 2
## [1] "Gender: F"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: NURSE PRACTITIONER"
## [1] "State: TX"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.114285714285714"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 2"
##
## Summarizing cluster: 3
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Middle"
## [1] "Primary.specialty: CHIROPRACTIC"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.242857142857143"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 3"
##
## Summarizing cluster: 4
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: INTERNAL MEDICINE"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.228571428571429"
## [1] "has.secondary.specialty: Y"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 4"
##
## Summarizing cluster: 5
## [1] "Gender: F"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: NURSE PRACTITIONER"
## [1] "State: TX"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: No Answer"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.0285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 5"
##
## Summarizing cluster: 6
## [1] "Gender: F"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: PHYSICAL THERAPY"
## [1] "State: NY"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: No Answer"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.0285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: N"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 6"
##
## Summarizing cluster: 7
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: FAMILY PRACTICE"
## [1] "State: PA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: Y"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 7"
##
## Summarizing cluster: 8
## [1] "Gender: M"
## [1] "Credential: No Answer"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: PHYSICAL THERAPY"
## [1] "State: CA"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: No Answer"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.257142857142857"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: N"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 8"
##
## Summarizing cluster: 9
## [1] "Gender: M"
## [1] "Credential: MD"
## [1] "Medical.school.name: Top"
## [1] "Primary.specialty: DIAGNOSTIC RADIOLOGY"
## [1] "State: TX"
## [1] "Professional.accepts.Medicare.Assignment: Y"
## [1] "Reported.Quality.Measures: Y"
## [1] "Used.electronic.health.records: No Answer"
## [1] "Committed.to.heart.health.through.the.Million.Hearts..initiative.: No Answer"
## [1] "years.after.grad: 0.285714285714286"
## [1] "has.secondary.specialty: N"
## [1] "has.hospital.affiliation: Y"
## [1] "Number.of.Group.Practice.members_log: 0.532539162864805"
## [1] "factor(fit$classification): 9"
Since we have unlabeled data, the Calinski-Harabaz Index will be used to evaluate the models. The higher the metric, the more dense and well separated the clusters. Source
| Network Analysis | K-Means | K-Medoid | Agglomerative Hierarchical Clustering | Gaussian Mixture Model |
|---|---|---|---|---|
| NA | 6741.78156 | 1.497815210^{5} | NA | 5447.8494992 |
Using the model comparison results above, it is clear that K-Medoid (PAM) has the highest Calinski-Harabaz Index and demonstrates better clustering. However, this data is using 1.330710^{4} observations whereas the other models are using 133061 observations, which skews the validity of this comparison. In the future, it may be helpful to have sampled an even smaller amount from the original population data from CMS.
For this analysis, the Gaussian Mixture Model would be chosen as the best model because the characteristics of this model suits our purpose the most. Indeed, after looking through the PCA results and cluster plots, there is much overlap in clusters which assumes mixed assignment of clusters. Indeed, I suspect that there are too few characteristics/variables in our dataset that could help discern more definitive clusters.
K-Means and K-Medoid could be viable options for a possible model. However, K-Medoid only took a small portion of the data, so this model would be difficult to scale and encompass more provider features and observations.
The network analysis would be interesting to pursue further especially if the full adjacency matrix were used to create the network. Another piece of future work could include adding a binary variable, included.in.hospital.network, as a dataset feature.